ballot_measure_poll <- read.csv("~/Downloads/Exercise/ballot_measure_poll.csv")
individual_demographics_and_scores <- read.csv("~/Downloads/Exercise/individual_demographics_and_scores.csv")
precinct_level_election_results <- read.csv("~/Downloads/Exercise/precinct_level_election_results.csv")
str(ballot_measure_poll)
## 'data.frame': 1099 obs. of 29 variables:
## $ voter_id : int 0 1 2 3 4 5 6 7 8 9 ...
## $ support_initiative : chr "yes" "yes" "yes" "no" ...
## $ region : chr "west" "west" "east" "east" ...
## $ county : chr "duchess" "duchess" "llandilo" "cuya" ...
## $ education : chr "college_graduate" "college_graduate" "college_graduate" "college_graduate" ...
## $ ses : chr "mid_ses" "low_ses" "wealthy" "wealthy" ...
## $ ethnicity : chr "race_A" "race_A" "race_B" "race_A" ...
## $ ideology : chr "moderate" "liberal" "liberal" "conservative" ...
## $ kids : chr "yes" "no" "yes" "yes" ...
## $ pro_authoritarianism_score : int 69 65 73 77 76 76 81 75 70 85 ...
## $ pro_taxes_score : int 50 43 34 34 34 42 39 36 42 36 ...
## $ pro_gunrights_score : int 39 56 44 60 69 42 42 68 46 58 ...
## $ pro_healthcare_score : int 62 58 35 28 32 31 33 40 23 27 ...
## $ pro_immigrants_score : int 50 58 50 32 26 26 24 37 29 20 ...
## $ pro_supporting_the_poor_score : int 51 49 38 26 35 37 40 34 33 39 ...
## $ environmentalist_score : int 49 51 54 53 53 53 55 53 53 57 ...
## $ trust_in_institutions_score : int 46 46 46 46 46 46 46 46 46 46 ...
## $ economic_populist_score : int 54 45 40 23 29 31 27 31 21 32 ...
## $ pro_military_score : int 40 42 56 57 59 60 61 52 55 72 ...
## $ pro_regulation_score : int 54 51 60 56 51 62 62 49 63 58 ...
## $ traditionalist_score : int 38 35 52 49 47 55 55 46 50 62 ...
## $ compassionate_score : int 66 66 50 43 50 52 50 47 52 43 ...
## $ pro_free_trade_score : int 42 49 48 46 40 30 33 48 33 37 ...
## $ pro_globalism_score : int 49 54 49 40 42 38 39 47 34 39 ...
## $ pro_healthcare_women_score : int 71 61 51 41 43 44 44 53 42 41 ...
## $ pro_populism_score : int 31 38 37 30 37 30 29 36 25 37 ...
## $ presidential_election_turnout_score: int 15 NA 13 71 76 25 72 31 31 72 ...
## $ racial_resentment_score : int 55 49 67 77 75 78 77 69 78 83 ...
## $ pro_religious_freedom_score : int 36 41 57 52 55 50 52 50 50 59 ...
str(individual_demographics_and_scores) # This is huge
## 'data.frame': 3421697 obs. of 15 variables:
## $ precinct : chr "precinct__1287" "precinct__357" "precinct__347" "precinct__1318" ...
## $ probability_race_G : num 3 1 1 59 96 49 98 94 98 97 ...
## $ probability_race_P : num 0 0 0 31 2 48 1 2 1 0 ...
## $ probability_race_O : num 97 99 99 2 1 3 0 2 0 0 ...
## $ gender : int 0 1 1 0 1 0 1 0 0 0 ...
## $ age : num 50 47 90 54 80 51 72 41 48 50 ...
## $ partisan_score : num 40 98 99 10 27 55 10 98 0 91 ...
## $ turnout_score : num 52 74 60 12 94 83 96 97 2 97 ...
## $ probability_highest_education_high_school: num 44 64 80 71 74 45 13 23 16 15 ...
## $ support_tax_on_wealthy_score : num 65 80 69 79 79 81 58 87 27 83 ...
## $ support_progressive_taxation_score : num 12 95 31 8 86 54 47 95 9 62 ...
## $ support_cannabis_legalization_score : num 23 44 49 34 43 68 18 85 22 49 ...
## $ probability_income_over_100k : num 72 20 10 36 52 45 85 46 93 90 ...
## $ probability_children_in_household : num 89 84 18 52 12 80 15 89 83 60 ...
## $ support_trump_score : num 77 4 34 92 14 54 29 1 95 28 ...
str(precinct_level_election_results)
## 'data.frame': 1128 obs. of 11 variables:
## $ votes_for_candidate_I : num 1286 728 1287 1837 169 ...
## $ votes_for_candidate_U : num 2974 1120 480 1833 409 ...
## $ county : chr "county__2" "county__1" "county__1" "county__3" ...
## $ precinct : chr "precinct__3" "precinct__4" "precinct__5" "precinct__7" ...
## $ population : int 5164 2643 2856 4368 701 809 3995 3742 1179 3981 ...
## $ votes_for_president : num 4313 1892 1804 3702 584 ...
## $ registered_voters : num 4747 2351 2531 4052 670 ...
## $ all_ballot_measure_votes : num 4023 1839 1739 3523 576 ...
## $ total_ballots : num 4361 1899 1824 3744 586 ...
## $ votes_against_ballot_measure: num 2216 861 500 1767 277 ...
## $ votes_for_ballot_measure : num 1807 978 1239 1756 299 ...
Our partners want to understand what happened on election day, please explore the precinct-level data. Precinct-level results for this ballot measure are provided in precinct_level_election_results.csv. Individual-level demographic features for voters registered to vote in these precincts is provided in individual_demographics_and_scores.csv. Please conduct a retrospective analysis of this election. - What factors do you think relate to support for this ballot measure? (use the precinct level aggregates found in precinct_votes_for_ballot_measure and precinct_votes_against_ballot_measure) - What factors do you think relate to turning out to vote in this election? (precinct_total_ballots tells you the number of ballots cast by voters in each precinct, while precinct_registered_voters gives the number of registered voters in the precinct). - What information do you think would aid our partner in voter outreach in a subsequent, similar, election?
Things notice right off the bat.
So for the individual demographics and scores there’s alot of people. I would probably just condense it down. I would probably do something more advance but I would get the average, sd, and also count for each one of these. Average would probably be a better one.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
individual_demographics_and_scores2 <- individual_demographics_and_scores %>% group_by(precinct) %>% summarise(
g_race = mean(probability_race_G, na.rm = TRUE),
p_race = mean(probability_race_P, na.rm = TRUE),
o_race = mean(probability_race_O, na.rm = TRUE),
gender = mean(gender, na.rm = TRUE),
age = mean(age, na.rm = TRUE),
partison_score = mean(partisan_score, na.rm = TRUE),
turnout_score = mean(turnout_score, na.rm = TRUE),
prob_highest_education_high_school = mean(probability_highest_education_high_school, na.rm = TRUE),
support_tax_on_wealthy_score = mean(support_tax_on_wealthy_score, na.rm = TRUE),
support_cannabis_legalization_score = mean(support_cannabis_legalization_score, na.rm = TRUE),
income_over_100k = mean(probability_income_over_100k, na.rm = TRUE),
children_in_household = mean(probability_children_in_household, na.rm = TRUE),
support_trump_score = mean(support_trump_score, na.rm = TRUE)
# Created but not used
#sd_race_g = sd(probability_race_G, na.rm = TRUE),
#sd_race_p = sd(probability_race_P, na.rm = TRUE),
#sd_race_O = sd(probability_race_O, na.rm = TRUE),
#sd_gender = sd(gender, na.rm = TRUE),
#sd_age = sd(age, na.rm = TRUE),
#sd_partisan_score = sd(partisan_score, na.rm = TRUE),
#sd_turnout_score = sd(turnout_score, na.rm = TRUE),
#sd_probability_highest_education_high_school = sd(probability_highest_education_high_school, na.rm = TRUE),
#sd_support_tax_on_wealthy_score = sd(support_tax_on_wealthy_score, na.rm = TRUE),
#sd_support_cannabis_legalization_score = sd(support_cannabis_legalization_score, na.rm = TRUE),
#sd_probability_income_over_100k = sd(probability_income_over_100k, na.rm = TRUE),
#sd_probability_children_in_household = sd(probability_children_in_household, na.rm = TRUE),
#sd_support_trump_score = sd(support_trump_score, na.rm = TRUE)
)
# removing to reduce storage
rm(individual_demographics_and_scores)
# joining datasets
bothjoined <- merge( individual_demographics_and_scores2, precinct_level_election_results, by.x = "precinct", by.y = "precinct") %>% mutate(against_percent = votes_against_ballot_measure/all_ballot_measure_votes,
for_percent = votes_for_ballot_measure/all_ballot_measure_votes, percent_registered_vote = registered_voters/population,
percent_of_registered = all_ballot_measure_votes/total_ballots
)
I am curious now in terms of visualization, the best way to do this is through pairs plots and correlations.
library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
my_fn <- function(data, mapping, ...){
p <- ggplot(data = data, mapping = mapping) +
geom_point() +
geom_smooth(method=lm, fill="blue", color="blue", ...)
p
}
ggpairs(bothjoined %>% select(g_race:age, for_percent:percent_of_registered, turnout_score), lower = list(continuous = my_fn), title="Race, Gender, Age")
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
ggpairs(bothjoined %>% select(partison_score:support_cannabis_legalization_score, for_percent:percent_of_registered, turnout_score),lower = list(continuous = my_fn), title="Partison, Turnout, Education, Tax, Marijuana")
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
ggpairs(bothjoined %>% select(income_over_100k:support_trump_score, for_percent:percent_of_registered, turnout_score),lower = list(continuous = my_fn), title="Income, Children, Trump")
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ tibble 3.1.8 ✔ purrr 1.0.1
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 1.0.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(corrr)
library(igraph)
##
## Attaching package: 'igraph'
##
## The following objects are masked from 'package:purrr':
##
## compose, simplify
##
## The following object is masked from 'package:tidyr':
##
## crossing
##
## The following object is masked from 'package:tibble':
##
## as_data_frame
##
## The following objects are masked from 'package:dplyr':
##
## as_data_frame, groups, union
##
## The following objects are masked from 'package:stats':
##
## decompose, spectrum
##
## The following object is masked from 'package:base':
##
## union
library(ggraph)
tidy_cors <- bothjoined %>% select(g_race:support_trump_score,for_percent, percent_registered_vote) %>%
correlate() %>%
stretch()
## Correlation computed with
## • Method: 'pearson'
## • Missing treated using: 'pairwise.complete.obs'
graph_cors <- tidy_cors %>%
filter(abs(r) > .4) %>%
graph_from_data_frame(directed = FALSE)
ggraph(graph_cors) +
geom_edge_link(aes(edge_alpha = abs(r), edge_width = abs(r), color = r)) +
guides(edge_alpha = "none", edge_width = "none") +
scale_edge_colour_gradientn(limits = c(-1, 1), colors = c("firebrick2", "dodgerblue2")) +
geom_node_point(color = "white", size = 5) +
geom_node_text(aes(label = name), repel = TRUE) +
theme_graph()
## Using "stress" as default layout
We can address this by looking at the plot above and seeing which tie
in to the for_percent the most.
Using a correlation cut off of .4 or greater for above.
What we care about is for the ballot measure. - positive correlation with Partison score and legalization (probably a leftist ballot) and support tax on wealthy score - negative correlation with trump and age.
For this one, need to get the correct definition of voter turnout. This can be precinct_total_ballots/precinct_registered_voters or precinct_registered_voters/population or this can be interpretted as voting. For argument sake using percent_registered to vote, which created above. The TurnoutScore provided by another provider actually is the strongest one to correlate to this. Examining the above one can see the following.
This is correlated to turnout score - towards the education (more educated more likely to vote) - more towards the income (higher income more likely to vote ) - older more likely to vote (age) - republicans more likely to vote.
Turnout score highly correlated with percent_registered_vote
If you look at the random forest below, you can see that the age, education, children in household, and income are the biggest factors. These all make sense because age and education from above, educated and active, while children in household you care more about state of affairrs and also income have bigger money on line, thus want to be more involved.
Know where to put focus and emphasis on. You would want to spend your political ads on people in the middle (to shift them left to your ad) and slightly to people on left more as an awareness and small activation. So the further left they are the more they are likely to support the iniative, so the less you need to advertise to them. The more center they are the more you need to advertise. So the data here can be used to identify target demographic.
For example, you probably dont want to do older and also more conservative people. You want to advertise towards the middle and left. So for example, here its seen that older people are more likely to vote and also agains the ballot. It would be really imperative to target the young people who would like this iniative and who do not go out to vote. This would be a swing. I would probably also target those that have a lower education because it does look like they have a small correlation and a negative correlation to coming out to target.
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:purrr':
##
## some
## The following object is masked from 'package:dplyr':
##
## recode
m0 <- lm(for_percent ~ partison_score + support_trump_score + age , data = bothjoined)
summary(m0) # So the reason why the coefficients don't make sense here is because of multicolinearity. Partison score and support trump score are highly correlated impacting the coefficient terms. So removing it (there are lots of ways to control for it, PCA, partial regression, decision tree instead, but thought just removing for now works)
##
## Call:
## lm(formula = for_percent ~ partison_score + support_trump_score +
## age, data = bothjoined)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.305379 -0.035444 0.002662 0.038229 0.282578
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.3254032 0.0390801 33.915 <2e-16 ***
## partison_score -0.0028931 0.0003024 -9.566 <2e-16 ***
## support_trump_score -0.0066246 0.0004382 -15.119 <2e-16 ***
## age -0.0046404 0.0003519 -13.187 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.07216 on 1124 degrees of freedom
## Multiple R-squared: 0.4703, Adjusted R-squared: 0.4688
## F-statistic: 332.6 on 3 and 1124 DF, p-value: < 2.2e-16
vif(m0)
## partison_score support_trump_score age
## 8.278442 8.629809 1.419890
#### Doing a bunch of permutations with the different variables, it was found that the best variables that kept the coefficients interpreatble, while keeping R square highest was the ones below
m1 <- lm(for_percent ~ support_tax_on_wealthy_score + age +prob_highest_education_high_school, data = bothjoined)
summary(m1)
##
## Call:
## lm(formula = for_percent ~ support_tax_on_wealthy_score + age +
## prob_highest_education_high_school, data = bothjoined)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.268586 -0.030921 -0.000468 0.028365 0.315107
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.4302965 0.0285656 15.06 <2e-16 ***
## support_tax_on_wealthy_score 0.0083642 0.0003119 26.82 <2e-16 ***
## age -0.0039796 0.0002963 -13.43 <2e-16 ***
## prob_highest_education_high_school -0.0032773 0.0001581 -20.73 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06288 on 1124 degrees of freedom
## Multiple R-squared: 0.5976, Adjusted R-squared: 0.5966
## F-statistic: 556.5 on 3 and 1124 DF, p-value: < 2.2e-16
vif(m1)
## support_tax_on_wealthy_score age
## 1.67446 1.32569
## prob_highest_education_high_school
## 1.30639
For interpret ability and action, the regression above achieves the highest r square while keeping variables coefficients somewhat accurate.So the problem with where the regression fails is that there are lot of variables that are correlated and you need to check assumption. You can do multicolinearity controls or you can use more advance models to tell the impact the variables have
I used a random forest as a cross reference check, because a random forest is non parametric. There’s non linear component to it something that measures economic affinity or something This and PCA or multifactor modeling would help tease some things out btu this would be side research. KISS - keep it super ismple.
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
## The following object is masked from 'package:dplyr':
##
## combine
library(varImp)
## Loading required package: measures
## Loading required package: party
## Loading required package: grid
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
##
## Attaching package: 'modeltools'
## The following object is masked from 'package:car':
##
## Predict
## The following object is masked from 'package:igraph':
##
## clusters
## Loading required package: strucchange
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: sandwich
##
## Attaching package: 'strucchange'
## The following object is masked from 'package:stringr':
##
## boundary
model <- randomForest(for_percent ~ partison_score + support_trump_score + age + support_cannabis_legalization_score + income_over_100k + gender + turnout_score + prob_highest_education_high_school + support_tax_on_wealthy_score + children_in_household, data = bothjoined, importance=TRUE)
importance(model)
## %IncMSE IncNodePurity
## partison_score 20.16037 1.4898947
## support_trump_score 20.39430 2.0651338
## age 21.99424 0.9117849
## support_cannabis_legalization_score 23.05373 1.9483734
## income_over_100k 26.46150 0.9902511
## gender 22.03205 0.5026855
## turnout_score 20.69239 0.4553624
## prob_highest_education_high_school 26.49490 0.9300000
## support_tax_on_wealthy_score 23.20174 1.1347139
## children_in_household 23.36897 0.4225857
varImpPlot(model)
This is what the random forest says for the biggest variables that impact the for campaign. In terms of direction, look at the correlation matrix from above.
model2 <- randomForest(turnout_score ~ partison_score + support_trump_score + age + support_cannabis_legalization_score + income_over_100k + gender + prob_highest_education_high_school + support_tax_on_wealthy_score + children_in_household, data = bothjoined, importance=TRUE)
importance(model2)
## %IncMSE IncNodePurity
## partison_score 18.19551 8138.228
## support_trump_score 15.48424 2867.535
## age 52.80181 13702.113
## support_cannabis_legalization_score 20.42998 4895.203
## income_over_100k 31.65407 19061.838
## gender 23.36440 3085.459
## prob_highest_education_high_school 39.71093 22025.812
## support_tax_on_wealthy_score 21.45176 14645.960
## children_in_household 36.02076 5962.028
varImpPlot(model2)
Look at the top 4 variables for impact for voter turnout.
ballot_measure_poll %>% group_by(support_initiative) %>% tally()
ggplot(data = ballot_measure_poll, aes(x = region, fill = support_initiative)) + geom_bar(position = "fill") + ggtitle("Region breakdown") + ylab("proportion")
north region has about 10% more yesses
ggplot(data = ballot_measure_poll, aes(x = education, fill = support_initiative)) + geom_bar(position = "fill") + ggtitle("Education breakdown") + ylab("proportion")
post graduate more likely to support
ggplot(data = ballot_measure_poll, aes(x = ses, fill = support_initiative)) + geom_bar(position = "fill")+ ylab("proportion") + ggtitle("Sess breakdown") # the wealthier the more they oppose to the initiave.
ggplot(data = ballot_measure_poll, aes(x = ideology, fill = support_initiative)) + geom_bar(position = "fill")+ ylab("proportion") + ggtitle("ideology breakdown")
you can see that the initiative is very polarized. Conversatives don’t
support while liberals do
ggplot(data = ballot_measure_poll, aes(x = kids, fill = support_initiative)) + geom_bar(position = "fill")+ ylab("proportion") + ggtitle("kids breakdown")
Kids don’t really affect the initive except for the people that don’t
know if they have kids
ggplot(ballot_measure_poll, aes(factor(support_initiative), pro_authoritarianism_score)) + geom_boxplot(aes(fill = factor(support_initiative)))+ ggtitle("pro_authoritarianism_score") # you can see a difference between the yes and the no. those with lower score are more likely to say yes
## Warning: Removed 53 rows containing non-finite values (stat_boxplot).
ggplot(ballot_measure_poll, aes(factor(support_initiative), pro_taxes_score)) + geom_boxplot(aes(fill = factor(support_initiative)))+ ggtitle("pro_taxes_score") # those with a higher tax score more likely to say yes
## Warning: Removed 53 rows containing non-finite values (stat_boxplot).
ggplot(ballot_measure_poll, aes(factor(support_initiative), pro_gunrights_score)) + geom_boxplot(aes(fill = factor(support_initiative)))+ ggtitle("pro_gunrights_score") # those that have lower score more likely to say yes
## Warning: Removed 53 rows containing non-finite values (stat_boxplot).
ggplot(ballot_measure_poll, aes(factor(support_initiative), pro_healthcare_score)) + geom_boxplot(aes(fill = factor(support_initiative)))+ ggtitle("pro_healthcare_score") # those more likely to have higher healthcare score more likely to say yes
## Warning: Removed 53 rows containing non-finite values (stat_boxplot).
ggplot(ballot_measure_poll, aes(factor(support_initiative), pro_immigrants_score)) + geom_boxplot(aes(fill = factor(support_initiative)))+ ggtitle("pro_immigrants_score") # immigration score has an impact
## Warning: Removed 53 rows containing non-finite values (stat_boxplot).
ggplot(ballot_measure_poll, aes(factor(support_initiative), pro_supporting_the_poor_score)) + geom_boxplot(aes(fill = factor(support_initiative)))+ ggtitle("pro_supporting_the_poor_score")
## Warning: Removed 53 rows containing non-finite values (stat_boxplot).
ggplot(ballot_measure_poll, aes(factor(support_initiative), environmentalist_score)) + geom_boxplot(aes(fill = factor(support_initiative)))+ ggtitle("environmentalist_score")
## Warning: Removed 57 rows containing non-finite values (stat_boxplot).
ggplot(ballot_measure_poll, aes(factor(support_initiative), trust_in_institutions_score)) + geom_boxplot(aes(fill = factor(support_initiative)))+ ggtitle("trust_in_institutions_score")
## Warning: Removed 53 rows containing non-finite values (stat_boxplot).
ggplot(ballot_measure_poll, aes(factor(support_initiative), economic_populist_score)) + geom_boxplot(aes(fill = factor(support_initiative)))+ ggtitle("economic_populist_score")
## Warning: Removed 53 rows containing non-finite values (stat_boxplot).
ggplot(ballot_measure_poll, aes(factor(support_initiative), pro_military_score)) + geom_boxplot(aes(fill = factor(support_initiative)))+ ggtitle("pro_military_score")
## Warning: Removed 53 rows containing non-finite values (stat_boxplot).
ggplot(ballot_measure_poll, aes(factor(support_initiative), pro_immigrants_score)) + geom_boxplot(aes(fill = factor(support_initiative)))+ ggtitle("pro_immigrants_score")
## Warning: Removed 53 rows containing non-finite values (stat_boxplot).
ggplot(ballot_measure_poll, aes(factor(support_initiative), pro_supporting_the_poor_score)) + geom_boxplot(aes(fill = factor(support_initiative)))+ ggtitle("pro_supporting_the_poor_score")
## Warning: Removed 53 rows containing non-finite values (stat_boxplot).
ggplot(ballot_measure_poll, aes(factor(support_initiative), environmentalist_score)) + geom_boxplot(aes(fill = factor(support_initiative)))+ ggtitle("environmentalist_score")
## Warning: Removed 57 rows containing non-finite values (stat_boxplot).
ggplot(ballot_measure_poll, aes(factor(support_initiative), trust_in_institutions_score)) + geom_boxplot(aes(fill = factor(support_initiative)))+ ggtitle("trust_in_institutions_score")
## Warning: Removed 53 rows containing non-finite values (stat_boxplot).
ggplot(ballot_measure_poll, aes(factor(support_initiative), economic_populist_score)) + geom_boxplot(aes(fill = factor(support_initiative)))+ ggtitle("economic_populist_score")
## Warning: Removed 53 rows containing non-finite values (stat_boxplot).
ggplot(ballot_measure_poll, aes(factor(support_initiative), pro_military_score)) + geom_boxplot(aes(fill = factor(support_initiative)))+ ggtitle("pro_military_score")
## Warning: Removed 53 rows containing non-finite values (stat_boxplot).
ggplot(ballot_measure_poll, aes(factor(support_initiative), pro_regulation_score)) + geom_boxplot(aes(fill = factor(support_initiative)))+ ggtitle("pro_regulation_score")
## Warning: Removed 53 rows containing non-finite values (stat_boxplot).
ggplot(ballot_measure_poll, aes(factor(support_initiative), traditionalist_score)) + geom_boxplot(aes(fill = factor(support_initiative)))+ ggtitle("traditionalist_score")
## Warning: Removed 53 rows containing non-finite values (stat_boxplot).
ggplot(ballot_measure_poll, aes(factor(support_initiative), compassionate_score)) + geom_boxplot(aes(fill = factor(support_initiative)))+ ggtitle("compassionate_score")
## Warning: Removed 53 rows containing non-finite values (stat_boxplot).
ggplot(ballot_measure_poll, aes(factor(support_initiative), pro_free_trade_score)) + geom_boxplot(aes(fill = factor(support_initiative)))+ ggtitle("pro_free_trade_score")
## Warning: Removed 53 rows containing non-finite values (stat_boxplot).
ggplot(ballot_measure_poll, aes(factor(support_initiative), pro_globalism_score)) + geom_boxplot(aes(fill = factor(support_initiative)))+ ggtitle("pro_globalism_score")
## Warning: Removed 53 rows containing non-finite values (stat_boxplot).
ggplot(ballot_measure_poll, aes(factor(support_initiative), pro_healthcare_women_score)) + geom_boxplot(aes(fill = factor(support_initiative)))+ ggtitle("pro_healthcare_women_score")
## Warning: Removed 53 rows containing non-finite values (stat_boxplot).
ggplot(ballot_measure_poll, aes(factor(support_initiative), pro_populism_score)) + geom_boxplot(aes(fill = factor(support_initiative)))+ ggtitle("pro_populism_score")
## Warning: Removed 53 rows containing non-finite values (stat_boxplot).
ggplot(ballot_measure_poll, aes(factor(support_initiative), presidential_election_turnout_score)) + geom_boxplot(aes(fill = factor(support_initiative)))+ ggtitle("presidential_election_turnout_score")
## Warning: Removed 144 rows containing non-finite values (stat_boxplot).
ggplot(ballot_measure_poll, aes(factor(support_initiative), racial_resentment_score)) + geom_boxplot(aes(fill = factor(support_initiative)))+ ggtitle("racial_resentment_score")
## Warning: Removed 53 rows containing non-finite values (stat_boxplot).
ggplot(ballot_measure_poll, aes(factor(support_initiative), pro_religious_freedom_score)) + geom_boxplot(aes(fill = factor(support_initiative)))+ ggtitle("pro_religious_freedom_score")
## Warning: Removed 53 rows containing non-finite values (stat_boxplot).
model3 <- randomForest(support_initiative ~ kids + ideology + ethnicity + ses + education + pro_authoritarianism_score + pro_taxes_score + pro_gunrights_score + pro_healthcare_score + pro_immigrants_score + pro_supporting_the_poor_score +
environmentalist_score + trust_in_institutions_score + economic_populist_score + pro_military_score + pro_regulation_score + traditionalist_score + compassionate_score + pro_free_trade_score + pro_globalism_score + pro_healthcare_women_score + pro_populism_score + presidential_election_turnout_score + racial_resentment_score + pro_religious_freedom_score , data = ballot_measure_poll %>% mutate(support_initiative = as.factor(support_initiative)) %>% na.omit(), importance=TRUE)
importance(model3)
## dont_know no yes
## kids -1.1194597 0.07733537 0.04364887
## ideology -0.1269588 19.93174694 19.77030611
## ethnicity -1.2787456 -1.20583668 2.61488550
## ses 1.0694766 -0.65736654 1.52352705
## education 0.7274660 -0.04487890 1.45821377
## pro_authoritarianism_score -4.4620069 7.53224881 6.27183314
## pro_taxes_score -2.0302918 1.81429163 8.60576884
## pro_gunrights_score -2.4327574 0.56480653 8.66763527
## pro_healthcare_score -2.4629323 4.43552881 9.54698186
## pro_immigrants_score -5.7593275 2.69825398 13.75682067
## pro_supporting_the_poor_score -2.7508420 4.12645125 6.63006463
## environmentalist_score -4.2763134 -1.05031702 9.44607546
## trust_in_institutions_score -1.0010015 -1.41625318 -1.45860804
## economic_populist_score -1.0328451 1.21196674 6.81716061
## pro_military_score -5.4240427 4.53879710 7.23478673
## pro_regulation_score -3.5670624 -0.09689486 7.24648622
## traditionalist_score -6.1877662 1.23901842 13.37928226
## compassionate_score -0.4480175 4.56035339 7.40479584
## pro_free_trade_score -2.6183819 -0.72839974 9.41918130
## pro_globalism_score -5.0047052 -0.75046600 12.38866988
## pro_healthcare_women_score -3.5551173 0.12778516 9.65121720
## pro_populism_score -3.0239938 2.56743261 7.10037601
## presidential_election_turnout_score -1.2209358 4.75628577 10.99726842
## racial_resentment_score -3.4365494 0.95923996 13.80944339
## pro_religious_freedom_score -2.4471525 1.90083969 8.99902604
## MeanDecreaseAccuracy MeanDecreaseGini
## kids -0.09221867 5.1488610
## ideology 27.65257816 31.3879340
## ethnicity 1.15006375 5.2273769
## ses 0.96647836 15.1946352
## education 1.27324192 12.4294274
## pro_authoritarianism_score 11.61420552 22.7781532
## pro_taxes_score 9.35860983 20.2756725
## pro_gunrights_score 7.95302187 25.7435747
## pro_healthcare_score 11.76805568 18.6109935
## pro_immigrants_score 15.10442023 20.6473256
## pro_supporting_the_poor_score 9.13858641 23.1518617
## environmentalist_score 8.30576485 9.2104605
## trust_in_institutions_score -1.78703160 0.7279759
## economic_populist_score 7.72457411 23.7077570
## pro_military_score 9.09507735 19.4037318
## pro_regulation_score 6.48754952 17.4306239
## traditionalist_score 13.03994118 18.3979080
## compassionate_score 10.60685707 25.4668593
## pro_free_trade_score 8.48538902 16.4223647
## pro_globalism_score 10.66151048 18.1453743
## pro_healthcare_women_score 10.01295504 20.8931286
## pro_populism_score 7.58373553 19.2750274
## presidential_election_turnout_score 13.26342028 32.1040624
## racial_resentment_score 15.58860157 21.8430805
## pro_religious_freedom_score 11.05598220 19.4607859
varImpPlot(model3)
These list the variables and their impact.
● Which score(s) would you recommend that the partners use for doing voter outreach so the ballot measure will be successful?
It seems a good chunk of these have some sort of predictive power. Based off visuals. I would rank them b the ones that the random forest sorted like: Looks like ideology, pro immigrants score,
● How should our partner use this score?
You should use the scores to asssess the liklihood that a candidate will vote and als understand the character profile of this candidate. Based off those characteristics and profile target them
We can create a machine learning classifier to rank the probability that a person will vote yes. We can target those that are 50% or above. You can see how easy it would be to change opinion on certain variables and try weighted targetting based off that
● How would you convey the value of using the targeting strategy you recommend?
I would compare the results that we have now with the strategy that was recomended and see if this would change. The delta I would attribute to the strategy and the targetting efforts.